from PIL import Image
import requests
url = 'C:\Users\prk\Desktop\Staten Island.PNG'
nyc_image = Image.open(requests.get(url, stream=True).raw)
nyc_image
New York City (NYC), is the most populous city in the United States. With an estimated 2019 population of 8,336,817 distributed over about 302.6 square miles (784 km2), New York is also the most densely populated major city in the United States. The city is the center of the New York metropolitan area, the largest metropolitan area in the world by urban landmass. With almost 20 million people in its metropolitan statistical area and approximately 23 million in its combined statistical area, it is one of the world's most populous megacities.
Staten Island of New York City is the least populated of the 5 boroughs of New York City with an estimated population of 476,143 in 2019, but is the third-largest in land area at 58.5 sq mi (152 km2).
In such a large borough with least population, finding a suitable location to open a restaurent is a daunting task. In this project, we will explore neighbourhoods of Staten Island and find which cuisines are popular in each neighbourhood, understand its demographic trends.
This analysis can be used to find a suitable neighbourhood for a restaurant. Additionally, this can also be used by a tourist travelling to Staten Island, New York City to visit neighbourhoods he/she prefers or a family looking forward to move to Staten Island, New York City and could decide which neighbourhood is best suited for them.
New York City has a total of 5 boroughs and 306 neighborhoods. In order to segment the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood.
New York City Department of City Planning published this data at https://geo.nyu.edu/catalog/nyu_2451_34572
Foursquare Developers Access to venue data: https://foursquare.com/developers/apps
Foursquare is a location data provider, which will be used to make RESTful API calls to retrieve data about restaurents in different neighborhoods.
Import all required libraries and download New York City dataset which contains alot of information about New York City. From this analyze and clean the data to pull data of borough, neighbourhoods, their latitude and longitude. This data exist in features dictionary, so transform it into pandas dataframe by looping trough the whole dataset to pull all required data. Once done, data will look like below.
The dataframe consists 5 boroughs of New York City data but we are exploring only Staten Island, so strip the data to extract only Staten Island neighbourhoods. After this data will look like below.
SI_neighborhoods.shape
Use geopy library to get the latitude and longitude values of New York City
address = 'New York City, NY'
geolocator = Nominatim(user_agent="nyc_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(f"The geograpical coordinates of New York City are {latitude}, {longitude}.")
Create a map of New York City with Staten Island neighborhoods superimposed on top
Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.
Define Foursquare Credentials and Version
CLIENT_ID = 'EPGMBXZ5TDS4J3CTQTAXIP5DSIAMYSAPWFZ5D2DBKJQU2XR0' # your Foursquare ID
CLIENT_SECRET = 'ECFCR3CNIGXJEYCWHZ1X2L30XJOJFNW3VIULS4W3N5LBULNY' # your Foursquare Secret
VERSION = '20200409' # Foursquare API version
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
Fetch Foursquare Venue Category Hierarchy
url = 'https://api.foursquare.com/v2/venues/categories?&client_id={}&client_secret={}&v={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION)
category_results = requests.get(url).json()
pprint(category_results)
{
"tags": [
"hide_input",
]
}
for data in category_list:
print(data['id'], data['name'])
Note that food id is '4d4b7105d754a06374d81259'
# function to flatten a 'parent_id' category, returns all categories if checkParentID = False
def flatten_Hierarchy(category_list, checkParentID, category_dict, parent_id = ''):
for data in category_list:
if checkParentID == True and data['id'] == parent_id:
category_dict[data['id']] = data['name']
flatten_Hierarchy(category_list = data['categories'], checkParentID = False, category_dict = category_dict)
elif checkParentID == False:
category_dict[data['id']] = data['name']
if len(data['categories']) != 0:
flatten_Hierarchy(category_list = data['categories'], checkParentID = False, category_dict = category_dict)
return category_dict
Now, we have all the categories in Food with their id's.
Explore the first neighborhood of Staten Island to understand the results of GET Request
Get the neighborhood's name.
SI_neighborhoods.loc[0, 'Neighborhood']
Get the neighborhood's latitude and longitude values.
neighborhood_latitude = SI_neighborhoods.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = SI_neighborhoods.loc[0, 'Longitude'] # neighborhood longitude value
neighborhood_name = SI_neighborhoods.loc[0, 'Neighborhood'] # neighborhood name
print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name,
neighborhood_latitude,
neighborhood_longitude))
Now, let's get the Food that is in St. George within a radius of 400 meters.
First, let's create the GET request URL to search for Venue with requested Category ID
pprint(results['response']['venues'])
Name of the food category is 'Deli / Bodega'
Create a function to repeat the above process to all the neighborhoods of Staten Island
def getNearbyFood(names, latitudes, longitudes, radius=400, LIMIT=100):
not_found = 0
print('***Start ', end='')
venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
print(' .', end='')
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&categoryId={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
"4d4b7105d754a06374d81259", # "Food" category id
LIMIT)
try:
# make the GET request
results = requests.get(url).json()['response']['venues']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['name'],
v['location']['lat'],
v['location']['lng'],
v['categories'][0]['name']) for v in results])
except:
not_found += 1
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Neighborhood',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
print(f"{not_found} venues with incompelete information.")
return(nearby_venues)
To counter any redundant requests to the Foursquare API, use pickle to serialize the information retrieved from GET requests.
import pickle # to serialize and deserialize a Python object structure
try:
with open('SI_food_venues.pkl', 'rb') as f:
SI_venues = pickle.load(f)
print("---Dataframe Existed and Deserialized---")
except:
SI_venues = getNearbyFood(names=SI_neighborhoods['Neighborhood'],
latitudes=SI_neighborhoods['Latitude'],
longitudes=SI_neighborhoods['Longitude'])
with open('SI_food_venues.pkl', 'wb') as f:
pickle.dump(SI_venues, f)
print("---Dataframe Created and Serialized---")
SI_venues.shape
SI_venues.head()
Find how many unique categories can be curated from all the returned venues
Remove the generalized categories, like Restaurent, Food.
SI_venues = SI_venues[SI_venues['Venue Category'].isin(food_categories)].reset_index()
print(SI_venues.shape)
SI_venues.head(5)
Analyze Each Neighborhood
# one hot encoding
SI_onehot = pd.get_dummies(SI_venues[['Venue Category']], prefix="", prefix_sep="")
SI_onehot.head()
# add neighborhood column back to dataframe
SI_onehot['Neighborhood'] = SI_venues['Neighborhood']
SI_onehot.head()
# move neighborhood column to the first column
neighborhood = SI_onehot['Neighborhood']
SI_onehot.drop(labels=['Neighborhood'], axis=1,inplace = True)
SI_onehot.insert(0, 'Neighborhood', neighborhood)
SI_onehot.head()
# count venues of each category in each neighborhood
venue_counts = SI_onehot.groupby('Neighborhood').sum()
venue_counts.head(5)
Find the top 10 food categories in Staten Island, NYC
venue_counts_described = venue_counts.describe().transpose()
venue_top10 = venue_counts_described.sort_values('max', ascending=False)[0:10]
venue_top10
Plot the top 10 food categories
import seaborn as sns
import matplotlib.pyplot as plt
venue_top10_list = venue_top10.index.values.tolist()
fig, axes =plt.subplots(5, 2, figsize=(20,20), sharex=True)
axes = axes.flatten()
for ax, category in zip(axes, venue_top10_list):
data = venue_counts[[category]].sort_values([category], ascending=False)[0:10]
pal = sns.color_palette("Blues", len(data))
sns.barplot(x=category, y=data.index, data=data, ax=ax, palette=np.array(pal[::-1]))
plt.tight_layout()
plt.show();
Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
SI_grouped = SI_onehot.groupby('Neighborhood').mean().reset_index()
SI_grouped.head()
# function to sort venues in descending order
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
Create the new dataframe and display the top 5 venues for each neighborhood.
num_top_venues = 5
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = SI_grouped['Neighborhood']
for ind in np.arange(SI_grouped.shape[0]):
neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(SI_grouped.iloc[ind, :], num_top_venues)
neighborhoods_venues_sorted.head()
Run k-means to count Neighborhoods for each cluster label for variable cluster size
SI_grouped_clustering = SI_grouped.drop('Neighborhood', 1)
Determine the optimal number of clusters for k-means clustering
The Elbow Method - calculate the sum of squared distances of samples to their closest cluster center for different values of k. The value of k after which there is no siginificant decrease in sum of squared distances is choosen.
sum_of_squared_distances = []
K = range(1,50)
for k in K:
kmeans = KMeans(n_clusters=k).fit(SI_grouped_clustering)
sum_of_squared_distances.append(kmeans.inertia_)
plt.plot(K, sum_of_squared_distances, 'bx-')
plt.xlabel('k value')
plt.ylabel('sum of squared distances')
plt.title('Elbow Method For Optimal k value');
There seems to be a slight bend at 9, but confirm this with Silhouette Method.
from sklearn.metrics import silhouette_score
sil = []
K_sil = range(2,50) # minimum 2 clusters required, to define dissimilarity
for k in K_sil:
print(k, end=' ')
kmeans = KMeans(n_clusters = k).fit(SI_grouped_clustering)
labels = kmeans.labels_
sil.append(silhouette_score(SI_grouped_clustering, labels, metric = 'euclidean'))
plt.plot(K_sil, sil, 'bx-')
plt.xlabel('k value')
plt.ylabel('silhouette score')
plt.title('Silhouette Method For Optimal k value')
plt.show()
There is a peak at k = 2,4,6 & 9. Since Elbow method also gave a result of 9 as best value, use 9 clusters
kclusters = 9
# run k-means clustering
kmeans = KMeans(init="k-means++", n_clusters=kclusters, n_init=50).fit(SI_grouped_clustering)
print(Counter(kmeans.labels_))
Create a new dataframe that includes the cluster as well as the top 5 venues for each neighborhood.
# add clustering labels
try:
neighborhoods_venues_sorted.drop('Cluster Labels', axis=1)
except:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
neighborhoods_venues_sorted.head(5)
# merge neighborhoods_venues_sorted with SI_data to add latitude/longitude for each neighborhood
SI_merged = neighborhoods_venues_sorted.join(SI_neighborhoods.set_index('Neighborhood'), on='Neighborhood')
SI_merged.head()
Visualize the resulting clusters
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)
# set color scheme for the clusters
colors_array = cm.rainbow(np.linspace(0, 1, kclusters))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(SI_merged['Latitude'], SI_merged['Longitude'], SI_merged['Neighborhood'], SI_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(map_clusters)
map_clusters
cluster_0 = SI_merged.loc[SI_merged['Cluster Labels'] == 0, SI_merged.columns[1:10]]
cluster_0.head(5)
required_column_indices = [2,1]
required_column = [list(SI_merged.columns.values)[i] for i in required_column_indices]
required_column
separator = '*'*50
for col in required_column:
print(cluster_0[col].value_counts(ascending = False))
print(separator)
cluster_1 = SI_merged.loc[SI_merged['Cluster Labels'] == 1, SI_merged.columns[1:10]]
cluster_1.head(5)
for col in required_column:
print(cluster_1[col].value_counts(ascending = False))
print(separator)
cluster_2 = SI_merged.loc[SI_merged['Cluster Labels'] == 2, SI_merged.columns[1:10]]
cluster_2.head(5)
for col in required_column:
print(cluster_2[col].value_counts(ascending = False))
print(separator)
cluster_3 = SI_merged.loc[SI_merged['Cluster Labels'] == 3, SI_merged.columns[1:10]]
cluster_3.head(5)
for col in required_column:
print(cluster_3[col].value_counts(ascending = False))
print(separator)
cluster_4 = SI_merged.loc[SI_merged['Cluster Labels'] == 4, SI_merged.columns[1:10]]
cluster_4.head(5)
for col in required_column:
print(cluster_4[col].value_counts(ascending = False))
print(separator)
cluster_5 = SI_merged.loc[SI_merged['Cluster Labels'] == 5, SI_merged.columns[1:10]]
cluster_5.head(5)
for col in required_column:
print(cluster_5[col].value_counts(ascending = False))
print(separator)
cluster_6 = SI_merged.loc[SI_merged['Cluster Labels'] == 6, SI_merged.columns[1:10]]
cluster_6.head(5)
for col in required_column:
print(cluster_6[col].value_counts(ascending = False))
print(separator)
cluster_7 = SI_merged.loc[SI_merged['Cluster Labels'] == 7, SI_merged.columns[1:10]]
cluster_7.head(5)
for col in required_column:
print(cluster_7[col].value_counts(ascending = False))
print(separator)
cluster_8 = SI_merged.loc[SI_merged['Cluster Labels'] == 8, SI_merged.columns[1:10]]
cluster_8.head(5)
for col in required_column:
print(cluster_8[col].value_counts(ascending = False))
print(separator)